{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DB Fixes\n",
    "\n",
    "Back up your database first!\n",
    "\n",
    "After inserting entries in database, we needed some minor fixes to clean up data.\n",
    "- remove slashes from identifier entries\n",
    "- find where there are duplicate entries for a given arXiv identifier and remove first entries\n",
    "- check which images have no matching metadata\n",
    "- automatically download arXiv OAI data and add it into the database\n",
    "- search arXiv metadata JSON file\n",
    "\n",
    "There is also code for replacing a table from another database.\n",
    "\n",
    "NB: could be bugs in this and might require manual setting up depending on dataset and metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import xml.etree.ElementTree as ET\n",
    "import sqlite3\n",
    "import requests\n",
    "import time\n",
    "import json"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Connect to database"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "db_path = os.path.expanduser(\"~/data/db/arxiv_db_images.sqlite3\")\n",
    "db = sqlite3.connect(db_path)\n",
    "c = db.cursor()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check that we can access the table pragma info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.execute('PRAGMA TABLE_INFO({})'.format(\"metadata\"))\n",
    "info = c.fetchall()\n",
    "\n",
    "print(\"\\nColumn Info:\\nID, Name, Type, NotNull, DefaultVal, PrimaryKey\")\n",
    "for col in info:\n",
    "    print(col)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Remove slashes\n",
    "\n",
    "We also searched the SQLite database for any entries with special characters such as '/' that caused errors in the file insertion. In each arXiv category pre-2007, we removed the forward slash as this could potentially cause problems in how this data is read (this can be added back in as required).\n",
    "\n",
    "This is done via SQLite command to replace the slash character with an empty string ('')."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### !!! MAKE SURE TO BACK UP DATABASE FIRST !!!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = (\"UPDATE metadata SET identifier=REPLACE(identifier, '/', '') WHERE identifier LIKE '%/%'\")\n",
    "c.execute(sql)\n",
    "db.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Duplicate entries\n",
    "\n",
    "There are a number of entries in the metadata table that are duplicates. This seems to be due to the way that arXiv maintains OAI metadata when there is a version update to a given article, or it could point to the way that `metha` downloads the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look for duplicate rows in the metadata table\n",
    "c.execute('''\n",
    "    SELECT identifier, COUNT(identifier) c\n",
    "    FROM metadata\n",
    "    GROUP BY identifier\n",
    "    HAVING c > 1\n",
    "    ''')\n",
    "rows = c.fetchall()\n",
    "print(\"number of rows:\",len(rows))\n",
    "for row in rows:\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "duplicate_ids = [r[0] for r in rows]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to have a look at the duplicate entries, do the below to print out and check if the relevant fields are the same"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "duplicates = rows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sql = ('''\n",
    "    SELECT id, identifier, created, cat, authors, title\n",
    "    FROM metadata\n",
    "    WHERE identifier IS ?\n",
    "''')\n",
    "\n",
    "for d in duplicates[:]:\n",
    "    c.execute(sql, (d[0], ))\n",
    "    rows = c.fetchall()\n",
    "    \n",
    "    same = True\n",
    "    r1 = rows[0][1:]\n",
    "    for row in rows[1:]:\n",
    "        if r1 != row[1:]:\n",
    "            same = False\n",
    "    if same is False:\n",
    "        print(\"-----\",d[0])\n",
    "        for row in rows:\n",
    "            print(row)\n",
    "        print(\"----- !!! mismatch !!! -----\")\n",
    "#     else:\n",
    "#         print(\"*** entry identitical ***\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Remove earlier entries for any that have duplicates\n",
    "\n",
    "This is quick as long as there has been an index created on metadata identifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "count = 0\n",
    "sql = ('''\n",
    "    SELECT id, identifier\n",
    "    FROM metadata\n",
    "    WHERE identifier = ?\n",
    "    ''')\n",
    "for dup_id in duplicate_ids[:]:\n",
    "    # first grab the ids of all rows in metadata table for this entry with duplicates\n",
    "    c.execute(sql, (dup_id, ))\n",
    "    rows = c.fetchall()\n",
    "    print(\"number of duplicates for dup_id:\",len(rows))\n",
    "    for row in rows:\n",
    "        print(row)\n",
    "        \n",
    "    print(\"deleting n-1 rows\")\n",
    "    for row in rows[:-1]:\n",
    "        print(row)\n",
    "        delete_sql = (\"DELETE FROM metadata WHERE metadata.id = ?\")\n",
    "        c.execute(delete_sql, (row[0], ))\n",
    "        count += 1\n",
    "#         test = c.fetchall()\n",
    "#         print(test)\n",
    "print(\"finished deleting, removed {} entries\".format(count))\n",
    "db.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Searching JSON\n",
    "\n",
    "JSON file provided by Kaggle - see https://www.kaggle.com/Cornell-University/arxiv\n",
    "This is a single large JSON file with all metadata. Useful to double check some things as this has been cleaned up and is used more widely. Note: reading the data in will take a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = []\n",
    "filepath = os.path.expanduser(\"~/data/metadata/arxiv-metadata-oai-snapshot-2020-08-14.json\")\n",
    "with open(filepath) as json_file:\n",
    "    for line in json_file:\n",
    "        data.append(json.loads(line))\n",
    "#     jdata = json.loads(json_file.read())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def search_json(data, name):\n",
    "    results = []\n",
    "    for dict in data:\n",
    "        if dict[\"id\"] == name:\n",
    "            results.append(dict)\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(search_json(data, \"0704.0342\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(search_json(data, rows[1][0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finding missing metadata\n",
    "\n",
    "Find where there is no matching metadata for a given image by comparing identifiers. Print and also keep this list in the `rows` array for next step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.execute('''\n",
    "    SELECT count(images.identifier), images.identifier\n",
    "    FROM images\n",
    "    LEFT JOIN metadata ON metadata.identifier = images.identifier\n",
    "    WHERE metadata.identifier IS NULL\n",
    "    GROUP BY images.identifier\n",
    "    ''')\n",
    "rows = c.fetchall()\n",
    "\n",
    "print(\"total number of identifiers:\", len(rows))\n",
    "print(\"total number of images:\", sum(row[0] for row in rows))\n",
    "for row in rows:\n",
    "    print(row)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a list with the identifiers only"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_identifiers = [x[1] for x in rows]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(missing_identifiers))\n",
    "print(missing_identifiers)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download the XML files for each of the identifiers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for missing_id in missing_identifiers:\n",
    "    \n",
    "    URL = \"http://export.arxiv.org/oai2?verb=GetRecord&identifier=oai:arXiv.org:{}&metadataPrefix=arXiv\".format(missing_id)\n",
    "    filepath = \"extra-xml/{}.xml\".format(missing_id)\n",
    "    if os.path.isfile(filepath) is False:\n",
    "        print(\"xml file doesn't exist\")\n",
    "        response = requests.get(URL)\n",
    "        with open(filepath, 'wb') as file:\n",
    "            file.write(response.content)\n",
    "        time.sleep(2)\n",
    "    else:\n",
    "        print(\"xml file already exists, skipping\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This final identifier was done manually, only one old identifier\n",
    "(it doesn't work because it needs a slash added to the url, this was removed in our database)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Iterate over all identifiers/XML files and grab key data, insert into SQLite"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for missing_id in missing_identifiers:\n",
    "    \n",
    "    filepath = \"extra-xml/{}.xml\".format(missing_id)\n",
    "\n",
    "    data = ET.parse(filepath)\n",
    "    root = data.getroot()\n",
    "    print(\"opening file: \" + filepath)\n",
    "\n",
    "    OAI = \"{http://www.openarchives.org/OAI/2.0/}\"\n",
    "    ARXIV = \"{http://arxiv.org/OAI/arXiv/}\"\n",
    "\n",
    "    try:\n",
    "        # get the correct nodes from XML\n",
    "        request_node = root.find(OAI+'GetRecord')\n",
    "        print(request_node)\n",
    "        record_node = request_node.find(OAI+'record')\n",
    "        print(record_node)\n",
    "        metadata_node = record_node.find(OAI+'metadata')\n",
    "        print(metadata_node)\n",
    "        arXiv_node = metadata_node.find(ARXIV+'arXiv')\n",
    "        print(arXiv_node)\n",
    "        id_node = arXiv_node.find(ARXIV+'id')\n",
    "        print(id_node)\n",
    "        identifier = id_node.text\n",
    "        print(identifier)\n",
    "        # id_node = root.find(OAI+'id')\n",
    "\n",
    "        date = arXiv_node.find(ARXIV+'created').text\n",
    "        categories = arXiv_node.find(ARXIV+'categories').text\n",
    "        title = arXiv_node.find(ARXIV+'title').text\n",
    "        abstract = arXiv_node.find(ARXIV+'abstract').text\n",
    "\n",
    "        license_node = arXiv_node.find(ARXIV+\"license\")\n",
    "\n",
    "        if license_node is not None:\n",
    "            lic = license_node.text\n",
    "            # print(\"license_node not None\")\n",
    "        else:\n",
    "            lic = \"\"\n",
    "\n",
    "        # attempt to get authors\n",
    "        authors_list = []\n",
    "        authors_element = arXiv_node.find(ARXIV+\"authors\")\n",
    "\n",
    "        # create a (string) variable to store all authors names\n",
    "        # for now, this just writes the whole list of authors as a string\n",
    "        anames = \"\"\n",
    "\n",
    "        for author in authors_element:\n",
    "            # print(author.find(ARXIV+\"keyname\").text)\n",
    "            aname = \"\"\n",
    "            kn = author.find(ARXIV+\"keyname\").text\n",
    "            fn_node = author.find(ARXIV+\"forenames\")\n",
    "            if fn_node is not None:\n",
    "                fn = fn_node.text\n",
    "            else:\n",
    "                fn = \"\"\n",
    "            aname =  kn + \", \" + fn + \"; \"\n",
    "\n",
    "            anames += aname\n",
    "\n",
    "        authors_list.append(anames)\n",
    "\n",
    "        # convert to string and remove extra characters\n",
    "        # authors = \"\" + str(authors)\n",
    "        authors = (str)(authors_list)[2:-4]\n",
    "\n",
    "        print(\"-\" * 20)\n",
    "        print(identifier)\n",
    "        print(date)\n",
    "        print(categories)\n",
    "        print(authors)\n",
    "        print(title)\n",
    "        print(abstract)\n",
    "        print(lic) # don't use license as it is reserved for Python!\n",
    "        print(\"-\" * 20)\n",
    "\n",
    "        c.execute(\"INSERT INTO metadata (identifier, created, cat, authors, title, abstract, licence) \\\n",
    "        VALUES (?, ?, ?, ?, ?, ?, ?)\", \\\n",
    "        (identifier, date, categories, authors, title, abstract, lic))\n",
    "\n",
    "    except KeyboardInterrupt:\n",
    "        db.commit()\n",
    "\n",
    "        # quit\n",
    "        sys.exit()\n",
    "    # except AttributeError as error:\n",
    "        # print(error)\n",
    "        # continue\n",
    "    except Exception as e:\n",
    "        raise e\n",
    "\n",
    "# finally commit the changes\n",
    "db.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### NB: We needed to run the code to remove slashes again after importing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replace a table from another database\n",
    "\n",
    "This is used if a table has been recreated elsewhere and we would like to replace in the original database."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.execute(f'ATTACH {db_path} as arxiv')\n",
    "c.execute(\"DROP TABLE metadata\")\n",
    "print(\"dropped table, vacuuming\")\n",
    "c.execute(\"VACUUM\")\n",
    "print(\"vacuuming done, creating new table and inserting data\")\n",
    "c.execute(\"CREATE TABLE metadata(id INTEGER PRIMARY KEY, identifier TEXT, created TEXT, cat TEXT, authors TEXT, title TEXT, abstract TEXT, licence TEXT)\")\n",
    "c.execute(\"INSERT INTO main.metadata SELECT * FROM arxiv.metadata\")\n",
    "print(\"done, committing\")\n",
    "db.commit()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And finally,"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "c.close()\n",
    "db.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
